Model Selection for Curve Classification

This project aimed to provide a proof of concept for an efficient and cost-effective solution to classify HRM (High Resolution Melt) curves. It was conducted within a statistics laboratory on the Montpellier Faculty of Pharmacy campus, under the scientific direction of Chrystelle Reynes and the supervision of SattAxlR with Laetitia Mahe.

These curves are obtained by gradually heating DNA and observing the evolution of fluorescence intensity: the characteristics of the curve allow identifying the moment when DNA strands separate, which enables identifying, or at least classifying, the observed genetic material.

Tasks & Objectives

The objective was to produce an algorithm capable of efficiently classifying DNA samples from 27 bacteria using their signature function obtained through HRM. These curves, used as a "signature" of the observed genetic material, are precise enough to detect even minor variations.

High resolution Melt curve

The project's goal was therefore to demonstrate the ability of these curves to characterize different parasites in the test dataset, in order to enable industrial scaling.

Actions and Development

During the first phase of the proof of concept, samples were classified using the statistical method LDA. Although achieving 100% accuracy, this method was only used on technical replicates, meaning each distinct category of bacteria was sequenced multiple times, but never twice for the same category. There was therefore no biological variability.

During the pre-maturation phase, when I was involved, we needed to address this issue by considering biological variability. The dataset was therefore expanded, and several bacteria belonging to the same group (for most bacteria) were sequenced, with multiple technical replicates for each.

Discrimination via LDA no longer worked at all when considering biological variability because this method assumes homoscedasticity of observed variables. This assumption is fundamentally violated in our case, as our variables are strongly dependent, being the 195 acquisition points of the curves (i.e., 195 (x, y) pairs): if two x values are close, their respective y values will also be close.

After a couple of months of research, the random forest machine learning algorithm solved this problem. This algorithm uses several decision trees for which both the training set (using sampling technique) and the variables used to build the tree are perturbed (a "classical" parameter is to retain only $\sqrt{n}$ randomly selected variables for each tree). The final classification of an individual by random forest is therefore the result of the forest's majority vote (classically 500 trees).

Random Forest Prediction

In particular, this classification method based on sampling allows handling datasets with strong correlation matrices and high dimensionality, which is our case here.

Results

Using cross-validation methods, the final algorithm achieved between 92% and 100% success rates. The randomness of these values depended both on the validation method and the random forest algorithm itself: during this phase, we did not select or save a particular model, but rather sought to validate the method.

The final algorithm complemented the initial data with numerical derivatives (adding 194 points), as well as characteristic points, such as the melting point. It also performed a recursive self-test phase to identify groups it couldn't separate reliably enough, and built a new aggregated group. The classification algorithm was complete when the classification was "stable" and included joined groups.

About two years after the end of this contract, I received a call from Laetitia Mahe informing me that the algorithm had passed the industrial maturation phase, meaning it had proven itself with real (industrial) datasets.

Technical Stack

The project relies on the following tools and technologies:

  • Programming Language : R
  • Machine Learning Package : randomForest
  • Statistical Methods : LDA (Linear Discriminant Analysis), Cross-validation
  • Algorithm : Random Forest, Decision Trees

It is important to note that this technical stack was imposed by the statistical research context. The major technical challenges encountered include:

  • Managing biological variability in HRM data
  • Processing datasets with strong correlation matrices
  • Developing a robust algorithm for curve classification